Artificial intelligence (AI) is an area of computer science that emphasizes the creation of intelligent machines that work and react like humans.
Artificial intelligence was born in the 1950s, when a handful of pioneers from the nascent field of computer science started asking whether computers could be made to “think”.
♟ ♟ Early Chess Programs ♟ ♟
First example of (symbolic AI)
Only involved hard-coded rules crafted by programmers
Human-level artificial intelligence = programmers handcraft a sufficiently large set of explicit rules
Could a computer go beyond “what we know how to order it to perform” and learn on its own how to perform a specified task?
Could a computer surprise us? Rather than programmers crafting data-processing rules by hand
Could a computer automatically learn these rules by looking at data?
2. What is Machine Learning?
The area of computational science that focuses on analyzing and interpreting patterns and structures in data to enable learning, reasoning, and decision making outside of human interaction.
Learning (statistical) patterns that governs a phenomenon.
Machine Learning == Statistical Learning
General programming vs. Machine Learning
A ML system is trained rather then programmed.
It is presented with many examples relevant to a task in order to find the statistical structure in these examples that eventually allows the system to come up with rules for automating the task.
Machine Learning Taxonomy
Supervised Learning
Develop predictive model based on both input and output data.
Classification vs Regression
Unsupervised Learning
Group and interpret data based only on input data.
Jargon
The features can also be referred to as the input, the X’s, the variables or covariates.
The target can also be referred to as the output, the y, the label, the class or the outcome.
The samples can also be referred to as the rows or the observations.
ML stages
ML domains and tasks
2. Scikit-learn
Scikit-learn (Sklearn) is a Machine Learning library that provides data preprocessing, modeling, and model selection tools.
There are many ways to import modules and classes in notebooks, but there is a best practice.
🚫 import sklearn # import of entire librarymodel = sklearn.linear_model.LinearRegression() # must type library and module prefix every time
🚫 import sklearn.linear_model # import of entire modulemodel = linear_model.LinearRegression() # must type module prefix every time
🚫 from sklearn import linear_model # import of entire modulemodel = linear_model.LinearRegression() # must type module prefix every time
🚫 from sklearn.linear_model import*# import of entire modulemodel = LinearRegression()
Explicit is better than implicit” - The Zen of Python
✅ from sklearn.linear_model import LinearRegression # explicit class import from modulemodel = LinearRegression() #=> we know where this object comes from
3. Linear Modeling with Sklearn
Consider the following dataset (download here). It is a collection of houses and their characteristics, along with their sale price. The full documentation of the dataset is available here.
import pandas as pddata = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv")data.head()
Id MSSubClass MSZoning ... SaleType SaleCondition SalePrice
0 1 60 RL ... WD Normal 208500
1 2 20 RL ... WD Normal 181500
2 3 60 RL ... WD Normal 223500
3 4 70 RL ... WD Abnorml 140000
4 5 60 RL ... WD Normal 250000
[5 rows x 85 columns]
Let’s start simple by modeling the SalePrice (y) according to the GrLivArea (X).
import matplotlib.pyplot as plt# Plot Living area vs Sale priceplt.scatter(data['GrLivArea'], data['SalePrice'])# Labelsplt.xlabel("Living area")plt.ylabel("Sale price")plt.show()
Training
Training a Linear Regression model with Sklearn LinearRegression
# Import the modelfrom sklearn.linear_model import LinearRegression# Instanciate the model (💡 in Sklearn often called "estimator")model = LinearRegression()# Define X and yX = data[['GrLivArea']]y = data['SalePrice']# Train the model on the datamodel.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
For example, a classifier like LogisticRegression will default scoring to accuracy.
Predicting
The trained model can be used to predict new data
# Predict on new datamodel.predict([[1000]])
array([127113.39664561])
/home/ahmed/.local/lib/python3.10/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
👉 An apartment with an surface area of 1000 \(ft^2\) has a predicted value of about $125k.
❗ Note that your X (features) almost always need to be a 2D-array when passed as an argument to an sklearn API method
Sklearn modeling flow
Import the model: from sklearn import model
Instantiate the model: model = model()
Train the model: model.fit(X, y)
Evaluate the model: model.score(new_X, new_y)
Make predictions: model.predict(new_X)
❓ What did we do wrong when scoring the model’s performance?
👉 We scored the model on the same data it was trained on!!
4. Generalization
The performance of a Machine Learning model is evaluated on its ability to generalize when predicting unseen data.
The Holdout Method
The Holdout Method is used to evaluate a model’s ability to generalize. It consists of splitting the dataset into two sets:
Training set (~70%)
Testing set (~30%)
Example
Imagine our dataset has 9 observations
💻 train_test_split
Let’s model the SalePrice (y) according to the GrLivArea (X) whilst keeping generalization in mind.
from sklearn.model_selection import train_test_split# split the data into train and testtrain_data, test_data = train_test_split(livecode_data, test_size=0.3)# Ready X's and y'sX_train = train_data[['GrLivArea']]y_train = train_data['SalePrice']X_test = test_data[['GrLivArea']]y_test = test_data['SalePrice']
You could also directly pass X and y to train_test_split.
# Ready X and yX = livecode_data[['GrLivArea']]y = livecode_data['SalePrice']# Split into Train/TestX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Training and scoring
# Instantiate the modelmodel = LinearRegression()# Train the model on the Training datamodel.fit(X_train, y_train)# Score the model on the Test data
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model.score(X_test,y_test)
0.5066793488829242
❓Can you think about any limitations of the Holdout Method?
Data split is random
Different random splits will create different results
### RUN THIS CELL MULTIPLE TIMES TO SEE DIFFERENT SCORES# Split into Train/TestX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)# Instantiate the modelmodel = LinearRegression()# Train the model on the Training datamodel.fit(X_train, y_train)# Score the model on the Test data
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model.score(X_test,y_test)
0.47140651143812695
Lose information
The data in the Test set is not used to train the model
If you have a small dataset, that loss could be significant
❓ How would you solve that issue?
👉 Average the scores of multiple holdout splits.
K-Fold Cross Validation
The dataset is split into K number of folds
For each split, a sub model is trained and scored
The average score of all sub models is the cross-validated score of the model
Dataframe view
💻 cross_validate
from sklearn.model_selection import cross_validate# Instantiate modelmodel = LinearRegression()# 5-Fold Cross validate modelcv_results = cross_validate(model, X, y, cv=5)# Scorescv_results['test_score']# Mean of scores